Automatic structuring of text files 1

نویسندگان

  • GERARD SALTON
  • CHRIS BUCKLEY
  • JAMES ALLAN
چکیده

SUMMARY In many practical information retrieval situations, it is necessary to process heterogeneous text databases that vary greatly in scope and coverage, and deal with many different subjects. In such an environment it is important to provide flexible access to individual text pieces, and to structure the collection so that related text elements are identified and appropriately linked. Methods are described in this study for the automatic structuring of heterogeneous text collections, and the construction of browsing tools and access procedures that facilitate collection use. The proposed methods are illustrated by performing searches with a large automated encyclopedia. In conventional information retrieval environments, documents are accessed by constructing large, so-called inverted, indexes containing all the distinct text words (except for some throwaway words), together with lists of document references that identify the documents, or document excerpts, in which the given text words occur. Information is retrieved by formulating Boolean queries consisting of text words interrelated by Boolean operators, consulting the corresponding lists of document references in the index, and identifying all documents that contain the proper combination of query terms. The retrieval technology using inverted term indexes together with Boolean query formulations has the considerable advantage that the identification of documents containing the required query term combination is extremely rapid. In general the responses are available in a matter of seconds, even when the file contains several million documents. Moreover, the retrieval effectiveness may be relatively high because the documents considered for retrieval, corresponding to entries in the appropriate reference lists in the index, are known in advance to contain at least one of the required search terms. On the negative side, the conventional technology does not offer browsing capabilities, because the files are maintained in random rather than subject matter order. Occasionally users are turned off when errors in the Boolean query formulations produce questionable responses. For example, in answer to the query " Esau and Jacob " (the two biblical characters in the Old Testament), an automated encyclopedia may retrieve the encyclopedia article on

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A survey on Automatic Text Summarization

Text summarization endeavors to produce a summary version of a text, while maintaining the original ideas. The textual content on the web, in particular, is growing at an exponential rate. The ability to decipher through such massive amount of data, in order to extract the useful information, is a major undertaking and requires an automatic mechanism to aid with the extant repository of informa...

متن کامل

Semi automatic indexing of PostScript files using Medical Text Indexer in medical education.

At Albert Einstein College of Medicine a large part of online lecture materials contain PostScript files. As the collection grows it becomes essential to create a digital library to have easy access to relevant sections of the lecture material that is full-text indexed; to create this index it is necessary to extract all the text from the document files that constitute the originals of the lect...

متن کامل

Biogeography-Based Optimization Algorithm for Automatic Extractive Text Summarization

    Given the increasing number of documents, sites, online sources, and the users’ desire to quickly access information, automatic textual summarization has caught the attention of many researchers in this field. Researchers have presented different methods for text summarization as well as a useful summary of those texts including relevant document sentences. This study select...

متن کامل

A symbolic approach to automatic multiword term structuring

This paper presents a three-level structuring of multiword terms (MWTs) basing on lexical inclusion, WordNet similarity and a clustering approach. Term clustering by automatic data analysis methods offers an interesting way of organizing a domain’s knowledge structures, useful for several information-oriented tasks like science and technology watch, textmining, computer-assisted ontology popula...

متن کامل

Automatic continuity of almost multiplicative maps between Frechet algebras

For Fr$acute{mathbf{text{e}}}$chet algebras $(A, (p_n))$ and $(B, (q_n))$, a linear map $T:Arightarrow B$ is textit{almost multiplicative} with respect to $(p_n)$ and $(q_n)$, if there exists $varepsilongeq 0$ such that $q_n(Tab - Ta Tb)leq varepsilon p_n(a) p_n(b),$ for all $n in mathbb{N}$, $a, b in A$, and it is called textit{weakly almost multiplicative} with respect to $(p_n)$ and $(q_n)$...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1992